Goto

Collaborating Authors

 Reinforcement Learning


Off-Policy Evaluation for Action-Dependent Non-Stationary Environments (Appendix)

Neural Information Processing Systems

A.1 How does the stationarity condition for a time-series differ from that in RL? Conventionally, stationarity is the time-series literature refers to the condition where the distribution (or few moments) of a finite sub-sequence of random-variables in a time-series remains the same as we shift it along the time index axis [Cox and Miller, 2017]. In contrast, the stationarity condition in the RL setting implies that the environment is fixed [Sutton and Barto, 2018]. This makes the performance J(ฯ€) of any policy ฯ€ to be a constant value throughout. In this work, we use'stationarity' as used in the RL literature. A.2 Can the POMDP during each episode (Figure 2) itself be non-stationary? Any source of non-stationarity can be incorporated in the (unobserved) state to induce another stationary POMDP (from which we can obtain a single sequence of interaction). The key step towards tractability is Assumption 1 that enforces additional structure on the performance of any policy across the sequence of (non-)stationary POMDPs. A.3 What if it is known ahead of time that the non-stationarity is passive only? A.4 How should different non-stationarities be treated in the on-policy setting?


Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework

Neural Information Processing Systems

This paper develops a Pontryagin Differentiable Programming (PDP) methodology, which establishes a unified framework to solve a broad class of learning and control tasks. The PDP distinguishes from existing methods by two novel techniques: first, we differentiate through Pontryagin's Maximum Principle, and this allows to obtain the analytical derivative of a trajectory with respect to tunable parameters within an optimal control system, enabling end-to-end learning of dynamics, policies, or/and control objective functions; and second, we propose an auxiliary control system in the backward pass of the PDP framework, and the output of this auxiliary control system is the analytical derivative of the original system's trajectory with respect to the parameters, which can be iteratively solved using standard control tools. We investigate three learning modes of the PDP: inverse reinforcement learning, system identification, and control/planning. We demonstrate the capability of the PDP in each learning mode on different high-dimensional systems, including multi-link robot arm, 6-DoF maneuvering quadrotor, and 6-DoF rocket powered landing.




Constrained Update Projection Approach to Safe Policy Optimization

Neural Information Processing Systems

Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe reinforcement learning methods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance.


Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

Neural Information Processing Systems

Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality.


Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance

Neural Information Processing Systems

Multi-Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi-agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [2]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL.


Generalized Hindsight for Reinforcement Learning

Neural Information Processing Systems

One of the key reasons for the high sample complexity in reinforcement learning (RL) is the inability to transfer knowledge from one task to another. In standard multi-task RL settings, low-reward data collected while trying to solve one task provides little to no signal for solving that particular task and is hence effectively wasted. However, we argue that this data, which is uninformative for one task, is likely a rich source of information for other tasks. To leverage this insight and efficiently reuse data, we present Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks. Intuitively, given a behavior generated under one task, Generalized Hindsight returns a different task that the behavior is better suited for. Then, the behavior is relabeled with this new task before being used by an off-policy RL optimizer. Compared to standard relabeling techniques, Generalized Hindsight provides a substantially more efficient re-use of samples, which we empirically demonstrate on a suite of multi-task navigation and manipulation tasks.


The NetHack Learning Environment Heinrich Kรผttler + Alexander H. Miller + Roberta Raileanu

Neural Information Processing Systems

Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminalbased roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment.


569ff987c643b4bedf504efda8f786c2-AuthorFeedback.pdf

Neural Information Processing Systems

We respond to their concerns in detail below. We hope this addresses your key concern. R2: We thank you for your supportive comments, and are somewhat surprised by the low score in light of them. As per your suggestion, we will release our full research code reproducing the paper's results as an additional supplement R4: We would like to thank you for your thorough and supportive review, and the suggestions contained therein. Figure 1, and update the link to the referenced video to include the exact time we refer to.